Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2

Long-read sequencing technologies have improved significantly since their emergence. Their read lengths, potentially spanning entire transcripts, is advantageous for reconstructing transcriptomes. Existing long-read transcriptome assembly methods are primarily reference-based and to date, there is little focus on reference-free transcriptome assembly. We introduce “RNA-Bloom2 [https://github.com/bcgsc/RNA-Bloom]”, a reference-free assembly method for long-read transcriptome sequencing data. Using simulated datasets and spike-in control data, we show that the transcriptome assembly quality of RNA-Bloom2 is competitive to those of reference-based methods. Furthermore, we find that RNA-Bloom2 requires 27.0 to 80.6% of the peak memory and 3.6 to 10.8% of the total wall-clock runtime of a competing reference-free method. Finally, we showcase RNA-Bloom2 in assembling a transcriptome sample of Picea sitchensis (Sitka spruce). Since our method does not rely on a reference, it further sets the groundwork for large-scale comparative transcriptomics where high-quality draft genome assemblies are not readily available.


Supplementary Tables
. Effect of digital normalization on reads aligned against assembly. Assemblies were performed with and without hybrid correction using Illumina reads. The percentage of long reads remaining after digital normalization is equal to the number of long reads remaining after digital normalization divided by the total number of long input reads to the assembly. Long input reads are aligned to each assembly with minimap2 to calculate the percentage of reads aligned (See Supplementary Method 3).

Supplementary Table 13. Expression of genes discarded and retained after error correction and digital normalization.
Gene expression was quantified by Trans-NanoSim using raw reads from ONT cDNA, ONT dRNA, and PacBio CCS. Error correction was performed using either long reads only or a hybrid of long and short reads. The number (N) of genes retained or discarded after error correction and digital normalization are reported. The minimum (min), lower whisker, first quartile (Q1), median, third quartile (Q3), upper whisker, and maximum (max) expression levels in transcripts per million (TPM) for the corresponding genes are reported.

Supplementary Figures
Supplementary Figure 1. Comparing RNA-Bloom2 and isONcorrect on simulated data. The number of complete transcripts and false discoveries detected in RNA-Bloom2 assemblies and isONcorrect error-corrected reads of two-million simulated cDNA and dRNA read data. The features of these simulated data are described in Supplementary Table 5. The arrows indicate the magnitude of improvement offered by RNA-Bloom2 compared to isONcorrect. Source data are provided as a Source Data file.

Supplementary Figure 2. Expression-stratified assembly recall on simulated cDNA data.
Recall is categorized based on transcript reconstruction levels: missing, partial, and complete. Transcript expression levels are split into quartiles: low, medium-low, medium-high, and high. Source data are provided as a Source Data file.

Supplementary Figure 3. Expression-stratified assembly recall on simulated dRNA data.
Recall is categorized based on transcript reconstruction levels: missing, partial, and complete. Transcript expression levels are split into quartiles: low, medium-low, medium-high, and high. Source data are provided as a Source Data file.

Supplementary Figure 4. K-mer multiplicity threshold for error correction.
In each tile of a read, the k-mer multiplicity threshold for distinguishing strong and weak k-mers is dynamically set to the maximum of the fixed global threshold and the local threshold. The global threshold is defaulted to 2, unless specified otherwise by the`-c`option in RNA-Bloom2.
To calculate the local threshold (M local ) for a tile, all k-mer multiplicity values within the tile are sorted in descending order and two consecutive values are evaluated at a time to search for the local threshold. When the second value is less than half of the first value, the local threshold is set to the first value and the search terminates. If no such local threshold is found, then the k-mer multiplicity threshold is set to the global threshold.

Supplementary Figure 5. Read depth tracking with strobemers.
Read A has an overlapping chain of strobemers with multiplicities at least 3, which implies that the other reads previously kept would already span across read A at least 3 times. Therefore, read A will not be kept during digital normalization. Since the middle region of read B is not spanned by any strobemers with multiplicity at least 3, the target depth of 3 has not been reached for this region. Therefore, read B will be kept during digital normalization.

Supplementary Figure 6. Trimming and splitting based on read depth.
The read depth across a given read can be tallied using read-to-read overlaps. Low-depth regions on both ends are trimmed. A read may also be split at internal low-depth region(s) into shorter segments.  Error rates are reported in`training_error_rate.tsv`.

Supplementary Method 5. Extraction of spike-in control reads.
All three replicates of the mouse ES sample from LRGASP were used for each platform: For ONT cDNA data, adapter-trimmed reads ("fulllength" and "rescued" FASTQs from Pychopper) are used (See Supplementary Method 1). For ONT dRNA and PacBio CCS data, raw reads are used.
Machine specification for initial runs: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 48 CPUs, 377 GB RAM Machine specification for re-running RATTLE and FLAIR: Intel(R) Xeon(R) CPU E7-8867 v3 @ 2.50GHz, 128 CPUs, 2.5 TB RAM All software programs are run with 48 threads whenever possible. For ONT direct RNA data, the option`--rna`is included for the`cluster`and`polish`modules. For replicated PacBio data, the option`--iso`is removed for the`cluster`module. For ONT direct RNA data, the option`--nvrna`is included for the`align`and`correct`modules.

Supplementary Method 8. Edge filtering in the overlap graph.
Each edge in the overlap graph represents the dovetail overlap between two sequences (i.e. the two incident nodes of the edge). An edge can be evaluated based on: R = the length distribution for all corrected reads s = the size of the overlap between sequences 1 and 2 n = the number of reads spanning across the overlap entirely c 1 = the length-normalized read count of sequence 1 c 2 = the length-normalized read count of sequence 2 R is extracted during the error correction stage. s is defined for each edge when the overlap graph is constructed according to the overlaps between reads. n, c 1 , and c 2 are determined by parsing the read alignments from the polishing stage. The length-normalized read count for a sequence is calculated as the total number of aligned bases in the aligned reads divided by the length of the sequence.
If n > min(c 1 , c 2 ), then the edge is kept. Otherwise, a one-tailed binomial test is applied to evaluate the statistical significance of observing this edge. R is used to calculate, q, the probability of observing a read shorter than s. From the binomial distribution, B(min(c 1 , c 2 ), 1 -q), the p-value for n is calculated. The edge is removed if p-value < 0.001.

Supplementary Method 11. Running SQANTI3 v5.1
For RNA-Bloom2 and RATTLE: For simulated data, options`--CAGE_peak`and`--polyA_motif_list`are removed and a filtered annotation GTF is used for each set of reads corresponding to the isoforms that are simulated. For runs on mouse data replicates, the complete reference annotation GTF is used.